527 research outputs found
Variable selection and regression analysis for graph-structured covariates with an application to genomics
Graphs and networks are common ways of depicting biological information. In
biology, many different biological processes are represented by graphs, such as
regulatory networks, metabolic pathways and protein--protein interaction
networks. This kind of a priori use of graphs is a useful supplement to the
standard numerical data such as microarray gene expression data. In this paper
we consider the problem of regression analysis and variable selection when the
covariates are linked on a graph. We study a graph-constrained regularization
procedure and its theoretical properties for regression analysis to take into
account the neighborhood information of the variables measured on a graph. This
procedure involves a smoothness penalty on the coefficients that is defined as
a quadratic form of the Laplacian matrix associated with the graph. We
establish estimation and model selection consistency results and provide
estimation bounds for both fixed and diverging numbers of parameters in
regression models. We demonstrate by simulations and a real data set that the
proposed procedure can lead to better variable selection and prediction than
existing methods that ignore the graph information associated with the
covariates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS332 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A sparse conditional Gaussian graphical model for analysis of genetical genomics data
Genetical genomics experiments have now been routinely conducted to measure
both the genetic markers and gene expression data on the same subjects. The
gene expression levels are often treated as quantitative traits and are subject
to standard genetic analysis in order to identify the gene expression
quantitative loci (eQTL). However, the genetic architecture for many gene
expressions may be complex, and poorly estimated genetic architecture may
compromise the inferences of the dependency structures of the genes at the
transcriptional level. In this paper we introduce a sparse conditional Gaussian
graphical model for studying the conditional independent relationships among a
set of gene expressions adjusting for possible genetic effects where the gene
expressions are modeled with seemingly unrelated regressions. We present an
efficient coordinate descent algorithm to obtain the penalized estimation of
both the regression coefficients and the sparse concentration matrix. The
corresponding graph can be used to determine the conditional independence among
a group of genes while adjusting for shared genetic effects. Simulation
experiments and asymptotic convergence rates and sparsistency are used to
justify our proposed methods. By sparsistency, we mean the property that all
parameters that are zero are actually estimated as zero with probability
tending to one. We apply our methods to the analysis of a yeast eQTL data set
and demonstrate that the conditional Gaussian graphical model leads to a more
interpretable gene network than a standard Gaussian graphical model based on
gene expression data alone.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS494 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data
Microarray time course (MTC) gene expression data are commonly collected to
study the dynamic nature of biological processes. One important problem is to
identify genes that show different expression profiles over time and pathways
that are perturbed during a given biological process. While methods are
available to identify the genes with differential expression levels over time,
there is a lack of methods that can incorporate the pathway information in
identifying the pathways being modified/activated during a biological process.
In this paper we develop a hidden spatial-temporal Markov random field
(hstMRF)-based method for identifying genes and subnetworks that are related to
biological processes, where the dependency of the differential expression
patterns of genes on the networks are modeled over time and over the network of
pathways. Simulation studies indicated that the method is quite effective in
identifying genes and modified subnetworks and has higher sensitivity than the
commonly used procedures that do not use the pathway structure or time
dependency information, with similar false discovery rates. Application to a
microarray gene expression study of systemic inflammation in humans identified
a core set of genes on the KEGG pathways that show clear differential
expression patterns over time. In addition, the method confirmed that the
TOLL-like signaling pathway plays an important role in immune response to
endotoxins.Comment: Published in at http://dx.doi.org/10.1214/07--AOAS145 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Transfer Learning with Random Coefficient Ridge Regression
Ridge regression with random coefficients provides an important alternative
to fixed coefficients regression in high dimensional setting when the effects
are expected to be small but not zeros. This paper considers estimation and
prediction of random coefficient ridge regression in the setting of transfer
learning, where in addition to observations from the target model, source
samples from different but possibly related regression models are available.
The informativeness of the source model to the target model can be quantified
by the correlation between the regression coefficients. This paper proposes two
estimators of regression coefficients of the target model as the weighted sum
of the ridge estimates of both target and source models, where the weights can
be determined by minimizing the empirical estimation risk or prediction risk.
Using random matrix theory, the limiting values of the optimal weights are
derived under the setting when , where is the
number of the predictors and is the sample size, which leads to an explicit
expression of the estimation or prediction risks. Simulations show that these
limiting risks agree very well with the empirical risks. An application to
predicting the polygenic risk scores for lipid traits shows such transfer
learning methods lead to smaller prediction errors than the single sample ridge
regression or Lasso-based transfer learning.Comment: 16 pages, 5 figure
Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications
New high-throughput technologies are generating various types of high-dimensional genomic and proteomic data and meta-data (e.g., networks and pathways) in order to obtain a systems-level understanding of various complex diseases such as human cancers and cardiovascular diseases. As the amount and complexity of the data increase and as the questions being addressed become more sophisticated, we face the great challenge of how to model such data in order to draw valid statistical and biological conclusions. One important problem in genomic research is to relate these high-throughput genomic data to various clinical outcomes, including possibly censored survival outcomes such as age at disease onset or time to cancer recurrence. We review some recently developed methods for censored data regression in the high-dimension and low-sample size setting, with emphasis on applications to genomic data. These methods include dimension reduction-based methods, regularized estimation methods such as Lasso and threshold gradient descent method, gradient descent boosting methods and nonparametric pathways-based regression models. These methods are demonstrated and compared by analysis of a data set of microarray gene expression profiles of 240 patients with diffuse large B-cell lymphoma together with follow-up survival information. Areas of further research are also presented
- …